Improving statistical machine translation using morpho-syntactic information
نویسنده
چکیده
In the framework of statistical machine translation, correspondences between the words in the source and the target language are learned from bilingual corpora, and often little or no linguistic knowledge is used to structure the underlying models. The work presented in this thesis is motivated by the well-known observation that training data typically does not sufficiently represent the range of phenomena in natural languages. In this thesis, various methods of incorporating morphological and syntactic information into systems for statistical machine translation are proposed and systematically assessed. The overall goal is to improve translation quality and to reduce the amount of parallel text necessary to train the model parameters. The development of the suggested methods is guided by the analysis of important causes of errors. Large differences in word order between corresponding sentences are difficult to capture for automatic alignment algorithms. In this work, a range of sentence level restructuring transformations is introduced, which are motivated by knowledge about the sentence structure in the involved languages. These transformations aim at the assimilation of word orders in related sentences. A detailed analysis of the effect on the corpora and the translation quality reveals that their application results in better alignments and as a consequence in less noisy probabilistic lexica, broader applicability of multi-word phrase pairs and a better coverage of the language model. Existing statistical systems for machine translation often treat different inflected forms of the same lemma as if they were independent of each other. A better exploitation of the bilingual training data can be achieved by explicitly taking into account the interdependencies of the related inflected forms. In this work a hierarchy of equivalence classes is defined on the basis of morphological and syntactic information beyond the surface forms. Features from those hierarchy levels are combined to form hierarchical lexicon models which can replace the standard probabilistic lexicon used in most statistical machine translation systems. The benefit from these combined models is twofold: Firstly, the lexical coverage is improved, because the translation of unseen word forms can be derived by considering information from lower levels in the hierarchy. Secondly, category ambiguity can be resolved, because syntactical context information is made locally accessible by means of annotation with morpho-syntactic tags. Conventional bilingual dictionaries are often used as additional data to better train the model parameters. One of the disadvantages of these dictionaries as compared to full bilingual corpora is the fact that their entries typically contain no context to enable the distinction between the translations for different readings of a word. In this work a method for aligning corresponding readings in conventional dictionaries containing pairs of fully inflected word forms is proposed. The approach uses information deduced from one language side to resolve category ambiguity in the corresponding entry in the other language. The resulting disambiguated dictionaries are better suited for improving the quality of machine translation, especially if they are used in combination with the hierarchical lexicon models. It is a costly and time consuming task to gather large texts and have them translated to form bilingual corpora suitable for training the model parameters for statistical machine translation. In this work the amount of bilingual data required to achieve an acceptable quality of machine translation is systematically investigated. All the methods presented in this thesis contribute to a better exploitation of the available bilingual data and thus to improving translation quality in frameworks with scarce resources. The combination of the suggested methods results in substantial improvements on the Verbmobil task, the Nespole! task and the Zeres task, for German to English and English to German translation and for text input and on the output of a speech recognizer. The second focus of this thesis is on evaluation of machine translation quality. A tool for the evaluation of translation quality which accounts for the specific requirements in a research environment is developed. Evaluation criteria which are more adequate than pure edit distance are defined. The measurement along these quality criteria is performed semi-automatically in a fast, convenient and consistent way using the tool and the corresponding graphical user interface. The quality criteria themselves are systematically assessed.
منابع مشابه
Improving Word Alignment Quality using Morpho-syntactic Information
In this paper, we present an approach to include morpho-syntactic dependencies into the training of the statistical alignment models. Existing statistical translation systems usually treat different derivations of the same base form as they were independent of each other. We propose a method which explicitly takes into account such interdependencies during the EM training of the statistical ali...
متن کاملMorphology In Statistical Machine Translation From English To Highly Inflectional Language
In this paper, we investigate the role of morphology in phrase-based statistical machine translation (SMT) from English to the highly inflectional Slovenian language. Translation to an inflectional language is a challenging task because of its morphological complexity. Rich morphology increases data sparsity and worsens the quality of statistical machine translation. The idea of the paper is to...
متن کاملReduction of Morpho-Syntactic Features in Statistical Machine Translation of Highly Inflective Language
We address the problem of statistical machine translation from highly inflective language to less inflective one. The characteristics of inflective languages are generally not taken into account by the statistical machine translation system. Existing translation systems often treat different inflected word forms of the same lemma as if they were independent of each other, although some interdep...
متن کاملMachine translation: statistical approach with additional linguistic knowledge
In this thesis, three possible aspects of using linguistic (i.e. morpho-syntactic) knowledge for statistical machine translation are described: the treatment of syntactic differences between source and target language using source POS tags, statistical machine translation with a small amount of bilingual training data, and automatic error analysis of translation output. Reorderings in the sourc...
متن کاملImproving Phrase-Based SMT with Morpho-Syntactic Analysis and Transformation
This paper presents our study of exploiting morpho-syntactic information for phrase-based statistical machine translation (SMT). For morphological transformation, we use hand-crafted transformational rules. For syntactic transformation, we propose a transformational model based on Bayes’ formula. The model is trained using a bilingual corpus and a broad coverage parser of the source language. T...
متن کاملApptek Turkish-English machine translation system description for IWSLT 2009
In this paper, we describe the techniques that are explored in the AppTek system to enhance the translations in the Turkish to English track of IWSLT09. The submission was generated using a phrase-based statistical machine translation system. We also researched the usage of morpho-syntactic information and the application of word reordering in order to improve the translation results. The resul...
متن کامل